Bridging weak supervision and privacy aware learning via sufficient statistics
نویسندگان
چکیده
We present a first attempt in connecting two areas of statistical learning that have not shared much common ground: weakly supervised learning and privacy aware learning. In the former, we aim to learn models of labeled data, when full information of the labels is not available; the latter concerns the design of algorithms with privacy guarantees for the protection of the data, while trading off utility for learning. We focus on classification with linear separators. There exists a sufficient statistic that summarizes all information from the label variable, the mean operator. The fact is known for a broad set of loss functions. Learning algorithms have exploited this property and overcome the lack of label knowledge, learning with label proportions only. We extend the result with almost no structural assumptions on loss functions and regularizers, and show how the approach is potentially viable for any weakly supervised task. Further, we consider the label as the only sensitive variable to protect, while the rest of the data is of public domain. In this scenario, we propose a simple method based on the Laplacian mechanism that obfuscates the mean operator and feed it to a learning algorithm which (a) enjoys α-label differential privacy, (b) is characterized by a generalization bound under almost no structural assumptions and (c) can be integrated into a secure data-sharing protocol for learning. Remarkably, some known results are recovered with simplified proofs.
منابع مشابه
Estimation from Indirect Supervision with Linear Moments
In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to es...
متن کاملWeak Supervision for Semi-supervised Topic Modeling via Word Embeddings
Semi-supervised algorithms have been shown to improve the results of topic modeling when applied to unstructured text corpora. However, sufficient supervision is not always available. This paper proposes a new process, Weak+, suitable for use in semi-supervised topic modeling via matrix factorization, when limited supervision is available. This process uses word embeddings to provide additional...
متن کاملBandit Label Inference for Weakly Supervised Learning
The scarcity of data annotated at the desired level of granularity is a recurring issue in many applications. Significant amounts of effort have been devoted to developing weakly supervised methods tailored to each individual setting, which are often carefully designed to take advantage of the particular properties of weak supervision regimes, form of available data and prior knowledge of the t...
متن کاملNursing Students’ Perspectives on Actual and Ideal Support and Supervision in Clinical Learning Environments in Zanjan University of Medical Sciences in 2011
Introduction: Clinical learning environment has an important role in clinical learning of nursing students. Any differences between students’ perspectives in expected and actual environment may result in decreased clinical learning. Therefore, the present study aimed to compare nursing students’ perspectives on actual and ideal support and supervision in clinical setting. Methods: In this desc...
متن کاملPrivacy-Preserving Bayesian Network Learning From Heterogeneous Distributed Data
In this paper, we propose a post randomization technique to learn a Bayesian network (BN) from distributed heterogeneous data, in a privacy sensitive fashion. In this case, two or more parties own sensitive data but want to learn a Bayesian network from the combined data. We consider both structure and parameter learning for the BN. The only required information from the data set is a set of su...
متن کامل